HW 01

Author

Usama Ahmed

0 - Setup

Code
if (!require("pacman")) 
  install.packages("pacman")
Loading required package: pacman
Code
# use this line for installing/loading
pacman::p_load(tidyverse,
               gridExtra,
               grid,
               gtable,
               ggpubr,
               ggmap,
               ggrepel,
               patchwork,
               units,
               data.table,
               devtools,
               openintro,
               tidytuesdayR,
               here,
               ggpubr) 

ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

options(width = 65)
 
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618,
  fig.retina = 3,
  fig.align = "center",
  dpi = 300,
  warning = FALSE,
  message = FALSE
)

1 - Road traffic accidents in Edinburgh

Code
accidents <- read_csv(here("data","accidents.csv"))

accidents$weekend.or.weekday <- 
  ifelse(!(accidents$day_of_week %in% c('Saturday','Sunday')),'Weekday','Weekend')

ggplot(accidents, aes(x =  time, fill = severity)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~weekend.or.weekday, dir = "v") + 
  scale_fill_viridis_d() +
  labs(x = "Time of day",
       y = "Density",
       title = "Number of accidents throughout the day", 
       subtitle = "By day of week and severity",
       fill = "Severity")

On weekdays, slight and serious injuries take place most frequently between 4pm to 6pm, whereas most of the fatal injuries take place between 10am to 12pm. Another peak time for slight and serious injuries is between 8am till 10am on weekdays. We can also observe that most of the serious and slight injuries take place at the same time. On weekends, no fatal injuries took place, and the distribution of slight and serious injuries is left skewed, i.e. most of them take place after 12pm

2 - NYC marathon winners

Code
nyc_marathon <- read_csv(here("data","nyc_marathon.csv"))

# a)

nyc_marathon |>
  ggplot(aes(x = time_hrs)) + 
  geom_histogram(fill = "purple",
                 color = "black",
                 alpha = 0.6) +
  labs(x = "Time",
       y = "Frequency",
       title = "Histogram of Marathon Completion Time",
       subtitle = "From a group of 69 participants")

Code
nyc_marathon |>
  ggplot(aes(x = 1, y = time_hrs)) + 
  geom_boxplot(fill = "purple",
               color = "black") +
  labs(x = NULL,
       y = "Time",
       title = "Boxplot of Marathon Completion Time",
       subtitle = "From a group of 69 participants")

Code
# b)

nyc_marathon |>
  ggplot(aes(x = division, y = time_hrs, fill = division)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("skyblue","pink")) +
  labs(x = "Division",
       y = "Time",
       title = "Boxplot of Marathon Completion Time",
       fill = NULL) 

Code
nyc_marathon |>
  ggplot(aes(x = division, y = time_hrs, fill = division)) + 
  geom_boxplot() +
  scale_fill_manual(values = c("skyblue","pink")) +
  theme(legend.position = "none") +
  labs(x = "Division",
       y = "Time",
       title = "Boxplot of Marathon Completion Time",
       fill = NULL) 

Code
# d)

nyc_marathon |> 
  filter(!is.na(time_hrs)) |>
  group_by(division,year) |>
  summarise(avg_time = round(mean(time_hrs),2), .groups = "drop") |>
  ggplot(aes(x = year, 
             y = avg_time,
             color = division,
             shape = division)) + 
  geom_point(size = 2.5) +
  geom_line(size = 0.8) +
  scale_color_manual(values = c("skyblue","pink")) +
    theme(
    legend.position = c(0.9,0.9),
    legend.box.background = element_rect(fill = "white",
                                         color = "white"),
    plot.subtitle = element_text(color = "darkgrey"),
   ) +
  labs(x = "Year",
       y = "Time",
       title = "Average marathon completion time by gender",
       subtitle = "From 1970 till 2020",
       color = NULL,
       shape = NULL) 

a - We can see the summary statistics i.e. median, quartiles, and outliers which are apparent in box plot but not in histogram. For histogram, we can see the distribution of time is right-skewed i.e. more people took < 2 hours and 20 mins to complete the marathon. It is not apparent in box plot.
b - Women took more time to complete the marathon then men,i.e. the interquartile range and median time is less for men than women. Moreover, the variability in data is less for men as compared to women i.e. more compact box and less spread of outliers.
c - The legend is redundant. We can see the segments on x-axis . Thus removing the legend impacts the data-to-ink ratio directly because the same information can be depicted with the use of less ink, hence improving the data-to-ink ratio.
d - In this plot, we can see the trend of average marathon times for men and women over the years. There has been a decline in the average time to complete the marathon for both men and women between 1970 to 1980s. The decline is approximately an hour for women and 30 minutes for men

3 - US counties

Code
data(county)

# Plot A

plot_a <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_point() + 
  labs(title = "Plot A") +
  theme_gray()

# Plot B

plot_b <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_point() + 
  geom_smooth(se = FALSE) +
  labs(title = "Plot B") +
  theme_gray()

# Plot C

plot_c <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_point() + 
  geom_smooth(aes(group = metro),color = "green",se = FALSE) + 
  labs(title = "Plot C") +
  theme_gray()

# Plot D

plot_d <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_smooth(aes(group = metro),color = "blue",se = FALSE) +
  geom_point() +
  labs(title = "Plot D") +
  theme_gray()
  

  # Plot E

plot_e <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_point(aes(color = metro)) +
  geom_smooth(color = "blue",
              se = FALSE,
              aes(group = metro,
              linetype = metro)) +
  labs(title = "Plot E") +
  guides(color = guide_legend(order = 2),
         linetype = guide_legend(order = 1)) +
  theme_gray()

 # Plot F

plot_f <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_smooth(aes(group = metro, color = metro),se = FALSE) +
  geom_point(aes(color = metro)) +
  labs(title = "Plot F") +
  theme_gray()

# Plot G

plot_g <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_point(aes(color = metro)) + 
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Plot G") +
  theme_gray()
    
# Plot H

plot_h <- ggplot(county, aes(x = homeownership, y = poverty)) +
  geom_point(aes(color = metro)) + 
  labs(title = "Plot H") +
  theme_gray()

# Making grids

gridExtra::grid.arrange(plot_a,plot_b,
                        plot_c,plot_d,
                        plot_e,plot_f,
                        plot_g,plot_h,
                        ncol = 2)

a - It is a two-layered plot. It plots geom_point of median education against household income and smoking ban against population of 2017 on the same plot. It does work since it is plotting a categorial variable (x-axis) against a continuous variable (y-axis) in both geom_point and geom_boxplot. However, it does not make any sense because of the scale difference for y-axis. Household income ranges between approximately 20k and 130k whereas the pop2017 ranges between 88 and 100000k.
b - The second plot (placing the facet across columns) makes more sense because we are trying to compare poverty levels with median education. Unlike first plot, the y axis is shared across all median education facets, so its easy to read and understand the plot. If we want to compare the facet with the variable on y-axis, we should place a facet across columns and vice versa.

4 - Rental apartments in SF

Code
rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

rent_filtered <- rent |> 
  filter(city == "san francisco" & room_in_apt  == 0 & !is.na(beds) & beds != 0 ) |>
  mutate(rental_price_per_bedroom = round(price/beds,2))

shortlisted_neighborhoods <- c("SOMA / south beach",
                               "pacific heights",
                               "marina / cow hollow",
                               "nob hill",
                               "sea cliff",
                               "mission district",
                               "san francisco",
                               "russian hill")

rent_filtered <- rent_filtered |>
  filter(nhood %in% shortlisted_neighborhoods) |>
  group_by(year,nhood) |>
  summarise(avg_rental_price_per_bedroom = round(mean(rental_price_per_bedroom),2),
            nhood_cap = toupper(nhood))

ggplot(rent_filtered, aes(x =  year, y = avg_rental_price_per_bedroom)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  facet_wrap(~nhood_cap, dir = "v", nrow = 2) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(x = "Year",
       y = "Price",
       title = "Average Rental Price Per Bedroom Across Neighborhoods", 
       subtitle = "Top 8 Neighborhoods by Number of Available Rentals in San Francisco") +
  theme_minimal()

Since I had two continuous variables on both x and y axis, I decided to go with geom_point(). However, instead of plotting rental price per bedroom against years, which was too cluttery and did not depict any trend, I decided to de-clutter it by taking average rental price per bedroom which was a good predictor of the actual rental price in this case. Adding a linear plot to the graph (geom_smooth()) helped to understand the trend of average rental prices per bedroom across years.
I removed beds with 0 and NA values to avoid inf and NA in rental price per bedroom. I took the top 8 states with most number of rentals available to analyze a dataset with significant observations for more accurate results.
The plots show that rental prices have been increasing for all 8 states over the years. In some cases, like Marina/Cow Hills, Soma/South Beach, and Mission District, there has been a sharp increase in rental prices over the years. However, for the neighborhoods like San Francisco and Sea Cliff, the rental prices have been increasing gradually.

5 - Napoleon’s march.

Code
pacman::p_load(here)
napoleon_data <- read_rds(here("data","napoleon.rds"))

cities <- napoleon_data$cities
temperatures <- napoleon_data$temperatures
troops <- napoleon_data$troops

city_temp <- left_join(cities, temperatures, by = "long")
city_temp$temp_and_city <- paste0(city_temp$city, ", ", city_temp$temp)

ggplot() +
  geom_path(data = troops, aes(x = long,
                   y = lat,
                   group = group,
                   color = direction,
                   size = survivors),
            lineend = "round") + 
  scale_size("Survivors", range = c(1, 10)) +
  geom_text(data = cities, aes(x = long, y = lat, label = city), color = "black") +
  scale_colour_manual(values = c("yellow", "red")) +
  theme(legend.position = "none") +
  labs(x = "Longitude",
       y = "Latitude",
       title = "Napoleon's March to Russia",
       subtitle = "Minard's Visualization",
       caption = "Source: Charles Joseph Minard")

Reources: https://rpubs.com/Minh_Bui/257561. It gave me the idea of using geom_path instead of geom_line and how to add a geom_text layer for city names. I also checked how to set the aesthetics in geom_path.
The code uses two layers, geom_path and geom_text, to visualize Minard’s Napoleon March to Russia. The geom_path uses troops data to set the x-axis as longitude, y-axis as latitude which are grouped by ‘group’ variable. The color is set to direction, i.e. whether the soldiers were marching or retreating, and the size to survivors i.e. how many soldiers were left on the path. The scale_size just scales the number of soldiers left during the march and retreat. Geom_text() adds the city names to specific longitude and latitude as a second layer. Moreover, I have removed the legends and set the fig.width to 10 in chunks to increase the width of the plot for better visualization.
I have changed the colors of march and retreat to yellow and red respectively to show the sentiments of soldiers during the journey. Yellow depicts hope and optimism which was the sentiment initially that they will succeed in their siege of Moscow. Red depicts death and fear. The change of colors show that how their sentiments transitioned from optimism to fear during the journey as their army strength kept on decreasing.